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Abstract 

We derive and analyze a new, efficient, pool-based active learning algorithm for halfspaces, 
called ALuMA. Most previous algorithms show exponential improvement in the label com- 
plexity assuming that the distribution over the instance space is close to uniform. This 
assumption rarely holds in practical applications. Instead, we study the label complexity 
under a large-margin assumption — a much more realistic condition, as evident by the suc- 
cess of margin-based algorithms such as SVM. Our algorithm is computationally efficient 
and comes with formal guarantees on its label complexity. It also naturally extends to the 
non-separable case and to non-linear kernels. Experiments illustrate the clear advantage of 
ALuMA over other active learning algorithms. 



1. Introduction 



We consider the challenge of pool-based active learning (McCallum and Nigam, 1998), in 



which a learner receives a pool of unlabeled examples, and can iteratively query a teacher 
for the labels of examples from the pool. The goal of the learner is to return a low-error 
prediction rule using a small number of queries. The number of queries used by the learner 
is termed its label complexity. This setting is most useful when unlabeled data is abundant 
but labeling is expensive, a common case in many data-laden applications. 

In some cases, pool-based active learning can provide an exponential improvement in 
label complexity over standard 'passive' learning. For instance, suppose the examples are 
points in [0, 1] and there exists some unknown threshold such that points below the threshold 
are classified as negative and points above the threshold are classified as positive. The 
sample complexity of learning a prediction rule with error less than e using an i.i.d. labeled 
sample is 0(l/e). In comparison, an active learner can select its queries to follow a binary 
search, and thus can achieve the same accuracy with a label complexity of only 0(ln(l/e)). 
This result holds for any distribution over [0, 1]. 

The example above is a special and very simple case of the important and highly com- 
mon setting of active learning of halfspaces in M°'. Much research has been devoted to this 
challenge, and several algorithms have been proposed. For instance, the Query By Com- 



mittee (QBC) algorithm (Seung et al. , 1992 Freund et al. 1997) and the CAL algorithm 



(Cohn et al. , 1994) both examine the unlabeled examples sequentially, and maintain a ver- 
sion space that is shrunk with each received label. QBC requests the label of an example 
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if two hypotheses randomly sampled from the current version space disagree on its label. 
CAL simply requests the label of any example whose label is not determined by the current 
version space. Another example is an active variant of the Perceptron algorithm, proposed 
in 



Dasgupta et al. (2005). 



If the marginal distribution of the data is uniform over a sphere centered at the origin, 
then all of these algorithms achieve an exponential improvement in the label complexity 
compared to passive learning, similarly to the single-dimensional case of thresholds on the 
line. 

For CAL, a more general result has been shown: a label complexity of 0(ln(l/e)) is 
achieved whenever the joint distribution of data and labels has a bounded disagreement 
coefficient dHanneke 2007 2009). This holds, for instance, for any "smooth" distribution 



( Friedman 



2009). However, the "O notation" here hides a dependency on the disagreement 



coefficient, which for some data distributions can be very large, making this upper bound 
similar to the sample complexity required by a passive learner. This makes the exponential 
improvement with respect to e meaningless in these cases. 



To illustrate this point, we consider the following example, due to Dasgupta (2006), 
showing a simple distribution in for which no significant improvement over passive 
learning can be achieved with any active learning algorithm. 



Example 1 Consider a distribution in for any d > 3. Suppose that the support of the 
distribution is a set of evenly- distributed points on a two-dimensional sphere that does not 
circumscribe the origin, as illustrated in the following figure. As can be seen, each point can 
be separated from the rest of the points with a half space. 



In this example, to distinguish between the case in which all points have a negative label 
and the case in which one of the points has a positive label while the rest have a negative 
label, any active learning algorithm will have to query every point at least once. It follows 
that for any e > 0, if the number of points is 1/e, then the label complexity to achieve an 
error of at most e is 1/e. On the other hand, the sample complexity of passive learning 
in this case is order of \ log \ , hence no active learner can be significantly better than a 
passive learner on this distribution!^ This proves the following claim: 



For all e G (0, ^) and d > 3, there exists a 
i'^, such that no active learner can have label complexity smaller than 1/e 
for learning the hypothesis class of origin- centered halfspaces on this distribution. On the 



Claim 1 Let C be a universal constant, 
distribution over [ 



1. The disagreement coefScient for this example is equal to the number of points. In addition, while the 



distribution in this example is not smooth, and thus the results of Friedman ( 2009 1 do not hold here, we 



can easily slightly change the distribution to make it smooth, while still having the same phenomenon. 



This does not contradict the result of Friedman (20091, as the exponential improvement from 1/e to 
ln(l/e) kicks in only when 1/e is larger than the number of points. 
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other hand, the sample complexity of the vanilla ERM passive learner for this distribution 
is — loe - . 

We see that there cannot be a true exponential improvement with respect to e which 
is uniform over ah distributions and target hypotheses. Moreover, the only known case 
where a non-trivial label complexity bound can be achieved for active learning is when 
the distribution is uniform (or nearly uniform) over a sphere centered at the origin. This 
is a serious limitation, since real applications rarely exhibit anything similar to a uniform 
distribution of their data. 

A second limitation of the result for CAL is that despite the logarithmic dependence 
on 1/e, the absolute label complexity of CAL can be much worse than that of the optimal 
algorithm. This is illustrated in the following example and the theorem following it. 

Example 2 Consider a distribution in that is supported by two types of points on an 
octahedron (see an illustration for ^ below). 

1. Vertices: {ei, . . . , e^}. 

2. Face centers: z/d for z G { — 1,+!}'^. 

Consider the hypothesis class W = {x i— sgn{{x,u)) — 1 + ^) \ w & { — Each 
hypothesis in W, defined by some w E {—1,+!}'^, classifies at most d+1 data points as 
positive: these are the vertices for i such that w[i] = +1, and the face center w/d. 



Theorem 1 Consider Example^for d> 3, and assume that the pool of examples includes 
the entire support of the distribution. There is an efficient algorithm that finds the correct 
hypothesis from W with at most d labels. On the other hand, with probability at least ^ over 
the randomization of the sample, CAL uses at least 2d+3 lo,bels to find the correct separator. 

Proof First, it is easy to see that if /i* G W is the correct hypothesis, then 

w = ih*{e,),...,h*ied)). 

Thus, it suffices to query the d vertices to discover the true w. 

We now show that the number of queries CAL asks until finding the correct separator 
is exponential in d. Consider some run of CAL (determined by the random ordering of the 
sample). Assume w.l.o.g. that each data point appears once in the sample. Let S be the set 
that includes the positive face center and all the vertices. Note that CAL cannot terminate 
before either querying all the 2^^ — 1 negative face centers, or querying at least one example 
from S. Moreover, CAL will query all the face centers it encounters before encountering 
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the first example from S. At each iteration t before encountering an example from S, there 

d+l 
2'i+d-t 

the first T = 2dXt examples are not from 5 is 



is a probability of r,f^}, that the next example is from S. Therefore, the probability that 



T— 1 , s ^ ^ T -2(d+l) 

1 -. ; > 1 1 ; > e 2d+d-T = e — ^ = - 

r=o ^ 2'^ + d-tJ-\ 2'^ + d-T) - e' 

where in the second equality we used 1 — a > exp(— 2a) which holds for all a € [0, \]. 
Therefore, with probability at least \ the number of queries is at least • ■ 

In this paper we present a new efficient pool-based active learning algorithm for learning 
high dimensional halfspaces. As Example [T] above shows, it is not possible to guarantee a 
significant improvement in label complexity that holds uniformly for all distributions. How- 
ever, for any given pool of unlabeled examples, there exists some optimal label complexity — 
the one that would be achieved by an optimal active learning algorithm for this pool. By 
"optimal" we mean an algorithm which queries the minimal number of labels in the worst- 
case sense, where worst-case here is with respect to the target hypothesis. Unfortunately, 
it is unknown whether there exists a polynomial-time optimal algorithm for the class of 
halfspaces. We therefore present an efficient approximation algorithm. That is, we will an- 
alyze the label complexity of our algorithm with a regret-type analysis: we bound the label 
complexity of our algorithm relative to the optimal label complexity that can be achieved 
for the given unlabeled pool. Therefore, if for a given pool of examples active learning 
can outperform passive learning, then our algorithm will enjoy a similar improvement. On 
the other hand, if the pool is inherently "difficult", as in Example [T| our algorithm might 
require the same number of labels as a passive learner. 

As in QBC and CAL, we maintain a version space and update it whenever we receive 
a label. Unlike QBC and CAL, we do not restrict ourselves to examining the unlabeled 
examples sequentially. Instead, we examine the entire pool at each iteration and select 
the next example to query in a greedy manner — we select an example that approximately 
maximizes the worst-case reduction in the version space 

Deriving active learning algorithms by greedily maximizing reductions in the version 



space has been recently proposed by Golovin and Krause (2010). In particular, they pro 



posed the notion of adaptive sub-modular optimization, and showed that in the case of 
a finite hypothesis class, a greedy selection rule can be competitive with the optimal al- 
gorithm, where the competitiveness factor depends on the logarithm of the size of the 
hypothesis class. However, in the case of halfspaces the hypothesis class is of infinite size, 
hence this analysis is not directly applicablej^ Furthermore, a generic implementation of 
their algorithm yields a runtime that depends linearly on the size of the hypothesis class. 

To tackle these issues, we will rely on the familiar notion of margin. We will assume that 
the negative and positive examples can be separated (or approximately separated) with a 



2. The efficient algorithm shown in the proof of Theorem^ is in fact an instance of this strategy. 

3. By Sauer's lemma, the effective size of the hypothes is class of halfspaces on a set of m unlabeled examples 



is only m . One can thus apply the algorithm of Golovin and Krause (20101 to this finite hypothesis 



class, however the runtime of the resulting algorithm will be exponential in d. 
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positive margin. Under this assumption, we are able to derive an efficient algorithm with a 
label-complexity competitiveness guarantee that depends on the margin. A larger margin 
will result in better regret guarantees for our algorithm. Some of the most popular learning 
algorithms (e.g. SVM, Perceptron, and AdaBoost) rely on a large margin assumption, and 
their practical success indicates that margin is a reasonable requirement for many real world 
problems. 

The margin assumption is beneficial in two additional aspects. First, our algorithm 
can handle high-dimensional data by performing a pre-processing step of dimensionality 
reduction. In particular, this can be performed when the data is represented by a kernel 
matrix. Kernels have had tremendous impact on machine learning theory and algorithms 
over the past decade (Cristianini and Shawe- Taylor , 2004 Scholkopf and Smola, 2002), so 
the ability to apply our algorithm with kernels is a clear advantage. We note that an efficient 
implementation of QBC with kernels has been proposed in Gilad-Bachrach et al. (2005), 
but no formal label complexity guarantees have been provided for the resulting algorithm. 

Second, our algorithm can handle samples with a small label error via a simple transfor- 
mation, that transforms any labeled sample to a sample which is separable with a margin. 
This margin depends on the optimal hinge-loss of the sample. This transformation can be 
applied for any algorithm, including CAL or QBC. However, the transformation leads to a 
data distribution which might be far from uniform, hence the label complexity guarantees 
of these algorithms do not apply. In contrast, since our analysis only assumes margin, the 
theoretical guarantees still hold for our algorithm. 

We note that several algorithms have been proposed for active learning in the agnostic 
case, for instance (Balcan et al. , 2006a) and IWAL (Beygelzimer et al. , 2009). These 



algorithms provide similar exponential speedups for the uniform distribution without re- 
quiring separability. However, unlike the algorithms for the separable case, these algorithms 
are computationally intractable when learning the class of halfspaces with the zero-one loss. 

To summarize, the proposed algorithm is computationally efficient, tolerant to some la- 
bel error, applicable for learning in high dimensional spaces, and comes equipped with label 
complexity guarantees that hold for any distribution in that (approximately) satisfies a 
margin assumption. 



Tong and Roller ( 2002 ) also addressed active learning of halfspaces under the assumption 



of separability with a margin. They proposed the principle of choosing the example that 
splits the version space as evenly as possible. They show that if at every round the chosen 
example splits the version space exactly in half, then the algorithm achieves the minimal 
possible label complexity (in a certain Bayesian sense). They implement several heuristics 
that attempt to follow this selection principle using an efficient algorithm. For instance, 
they suggest to choose the vector x which is closest to the max-margin solution of the data 
labeled so far. However, no formal guarantees are provided for these heuristics relative to 
the proposed principle. Moreover, for each of these heuristics there are cases where the split 
induced by the example selected by the heuristic is much less balanced than the one induced 
by the most balanced example. In Balcan et al. (2007), an active learning algorithm with 
guarantees under a margin assumption is proposed. However, these guarantees hold only if 
the data distribution is uniform over an ellipsoid. 

One may wonder if a large margin assumption alone can guarantee a uniform improve- 
ment in label complexity. This is not the evident by the following example. 
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Example 3 Let 7 G (Oi 5) be a margin parameter. Consider the uniform distribution 
supported by N points in W^, such that all the points are on the unit sphere, and for each 



pair of points xi and X2 in the support, {xi,X2) < 1 — 27. It was shown in Shannon (1959) 
that for any N < 0{{1/^Y), there exists a set of points that satisfy the conditions above. 
Now, for any point x in the support, there exists a (biased) linear separator that separates 
X from the rest of the points with a margin of 7. This can be seen by letting w = x and 
6 = 1 —7. Then {w, x) — 6 = 7 while for any z ^ x in the set, {w, z) — b = {x, z) — 1+7 < —7. 
By adding a single dimension, this example can be transformed to one with origin- centered 
halfspaces. 

Just as in Example [l| here too any active learner must query all N points in the worst case. 
We have thus proved: 

Claim 2 Let C be a universal constant. For a// d > 2, 7 G (0, |), = r2((l/7)'^~^) and 
e G (;^) 5); there exists a distribution over W^, such that no active learning algorithm can 
have label complexity smaller than 1/e for learning the hypothesis class of origin- centered 
halfspaces on this distribution, even if the learner knows that the data is separable with 
margin of ^. On the other hand, the sample complexity of the vanilla ERM passive learner 
for this distribution is 

We provide a formal problem statement in Section [2| and state our main results in 
Section [3| Section |4] describes the ALuMA algorithm, and Section [5] outlines the proof 
of the properties of ALuMA. Section [6] provides the transformation that allows ALuMA 
to handle kernel representations and labeling errors. In Section [7] we describe a simpler 
implementation of ALuMA, which is useful for running the algorithm in practice. We 
report experiments on real and synthetic datasets in Section [8j The detailed proofs of our 
main results are disclosed in Appendix [X] and Appendix [Bj 



2. Formal Problem Statement 

In pool-based active learning, the learner receives as input a set of instances, denoted 
X = {xi, . . . ,Xm}- Each instance is associated with a label (which is initially unknown to 
the learner) . The goal of the active learner is to label all x G X correctly using as few label 
queries as possible. If X is a sample drawn i.i.d. from a fixed distribution, then the labels 
the active learning algorithm outputs can then be used to train a classifier with low error 
on the distribution, using any passive learning algorithm that receives the labeled sample 
as input. Therefore, from now on we focus on the problem of determining the labels of the 
examples in X. 

The learner has access to a teacher, represented by an oracle L : [m] — t- {—1,1}. The 
goal of the learner is to find the values L{\), . . . , L{m) using as few calls to L as possible. 

We study active learning of the hypothesis class of halfspaces. Let be the unit 
ball in W^. We assume that X C Bf, and that there exists some w* G B^ such that 
L{i) = sgn{{w*,Xi)) for all i G [m]. The label complexity of an active learning algorithm 
is the maximal number of calls to L that it makes before determining the labels of all 
instances in X, where the maximum is over all the possible functions L determined by 
a G Mf. Formally, an active learning algorithm A can call L several times, and then 
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should output (L(xi), . . . , L(xm)). Let N{A, w*) be the number of calls to L that A makes 
before outputting . . . , L{xm)) if L is determined by w* . The (worst-case) cost of A 

is defined as Cwc{A) = uia.-K^,^^d N{A,w*). We define the worst-case cost of the optimal 
algorithm by 

OPT = minc^e(^), (1) 
A 

where the minimum is over all active learning algorithms. 

As mentioned in the introduction, our label complexity guarantees will be relative to 
the optimal label complexity that can be achieved for the given sample X. We refer to OPT 
as a measure of the difficulty of the active learning problem and expect our algorithm to 
succeed if OPT is "small" . When analyzing the label complexity of our algorithm we make 
two relaxations. First, we further assume that (X, L) is separated with a margin 7, that is 
that there exists a w* such that minjg[m] L{i){w* , Xi) > 7. The label complexity guarantees 
of our algorithm depend on 7. We describe a relaxation of this assumption of separability 
with a margin in Section |3| 

Second, we allow our algorithm to make randomized decisions, and require it to output 
(yi, . . . ,ym) such that with high probability yi = L{i) for all i. That is, the algorithm 
is allowed to fail with small probability. We will show that the number of calls to L our 
algorithm makes is not much larger than 0PT|^ 



3. Main results 

It is easy to see that the problem of finding a policy that implements OPT for a general 
hypothesis class is at least as hard as the problem of finding a decision tree of minimal height. 



Unfortunately, this problem is NP-hard in the general case (Arkin et al. , 1993). Using the 
additional assumption of separability with a margin, we provide an efficient algorithm that 
finds the correct labeling of the sample with high probability, using a number of queries 
which is comparable to OPT. Specifically, we prove the following theorem: 

Theorem 2 Let X = {xi, . . . , Xm} ^ ^f- Assume that {X, L) is separable with a margin 7. 
Let 6 £ (0, 1) be a confidence parameter. There exists a pool-based active learning algorithm 
with the following guarantees: 

1. With probability at least 1 — 5, it returns . . . , L{m). 

2. It requests at most OPT • 4(2ciln(2/7) + ln(2)) labels, where OPT is defined in Equa- 
tion 

3. Its running time is polynomial in m, d and \b.{\/5). 

The guarantees of the theorem above depend linearly on d, the dimension of the repre- 
sentation of the examples. This may be an issue in very high dimensions, and is prohibitive 



Note that the requirements of our algorithm are easier than those of the optimal algorithm via the 
definition of OPT, since in the definition of c^c (and therefore of OPT) we do not maximize only over 
those w* that achieve margin 7, and we do not allow A to fail with small probability. It is easy to derive 
cases in which OPT is large, but under the margin assumption there exists an algorithm that makes few 
calls to L. Nevertheless, OPT is a reasonable measure of the "difficulty" of the active learning problem. 
We leave further research on relaxed definitions of OPT to future work. 
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in the case of a kernel representation of X. In addition, this theorem holds only for fully 
separable data. These limitations can be circumvented by applying a fairly simple transfor- 
mation on the points in X as a preprocessing step. This transformation maps the points of 
X to a set of points in a lower dimension, such that the new set is separable with respect to 
the same oracle L. The input points X can be represented either directly as points in M*^, or 
via a kernel matrix k{x, x') for x, x' G X. The dimension of the new representation depends 
on the hinge-loss of {(xi, L(l)), . . . , {xm, L{m))} with respect to the margin. Our algorithm 
can then be applied to the low-dimension points to retrieve the labels of X, with time com- 
plexity and label complexity approximation guarantees that do not depend on the original 
dimension d. As a final step, the labeled sample {(.^i, L{1)), . . . , (xm, L{m))} can be used 
as input to a passive learner. Note that in this scheme the generalization bounds depend 
only on the properties of X and not on the properties of the low-dimensional mapping. The 
following theorem formalizes the properties of the transformation. 

Theorem 3 Let X = {xi, . . . ,Xm} C B, where B is the unit ball in some Hilbert space. 
Let H >0 and 7 > 0, and assume there exists a w* E B such that 



H >^max{0, J - L{i){w*,Xi)f 



i=l 



Let 5 G (0, 1) be a confidence parameter. There exists an algorithm that receives X as vectors 
i'^ or as a kernel matrix K G ]^™x™^ ^^(^ input parameters 7 and H, and outputs a set 



X = {xi, . . . , Xm} ^ ^ , such that 



1_ fc^pA^+iW^) 



2. With probability I — 6, X CM\ and {X, L) is separable with a margin 2^2y/H ' 

3. The run-time of the algorithm is polynomial in d,m, I/7, ln(l/(5) if Xi are represented 
as vectors in d, and is polynomial in m, l/7,ln(l/5) if Xi are represented by a kernel 
matrix. 



Finally, consider the common learning setting in which there is some distribution D over 
(instance,label) pairs, and . . . , {xm, L{m))} are i.i.d. pairs drawn from D. If D 

is separable with a margin 7, then it suffices to have a labeled sample of size m = 0(;^) 
to allow passive learning to accuracy e. The transformation above thus maps the sample 
to dimension 0(:p- ln(;p^)). Executing our algorithm on the resulting sample wc get an 

active learning algorithm that runs in time polynomial in ^, ^ and ln(l/(5), and has a label 
complexity which is guaranteed to be no more than OPT ■ ln(;^^)), where OPT is 
the number of queries required by an optimal active learning algorithm on the transformed 
sample. 



4. The ALuMA algorithm 

In this section we describe our algorithm. We name it Active Learning under a Margin 
Assumption or ALuMA for short. 
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To explain our approach, it is convenient to think about the active learning process as 
a search procedure for the vector w* that determines the labels L(l), . . . , L(m). Suppose our 
first t calls to L are for the labels . . . , L{it), and denote Pt = {(xj^, L(ii)), . . . , (xjj, L{it))}. 

Then, we know that w* must be in the set 

V{Pt) = {wGMf: V(x, y) G Pt, y{w, x) > 0} . 

The set V{Pt) is called the version space induced by Pt. 

Intuitively, we would like to query labels that will make V{Pt) "small". There are many 
ways to define the "size" of V{Pt). One way, which is convenient for our analysis, is to 
define the size of a version space by its volume, denoted Yol{V {Pt)). Therefore, ideally, we 
would like to query the labels L{ii), . . . , L{it) for which Yol(y{Pt)) is minimal. Naturally, 
the size of V{Pt) depends on the actual labels we will receive from L, which are unknown to 
us prior to querying them. We therefore would like to query the labels for which Vol{V{Pt)) 
is minimal in the worst-case, where the worst-case is over all possible vectors w* that may 
determine L{1), . . . , L{m). 

The number of all possible sequences of t queries grows exponentially with t, which poses 
a computational challenge. We thus follow a simple greedy approach: At each iteration, 
query the example whose label will lead to a version space of minimal size. Again, since we 
do not know the label prior to querying, we would like to choose the example which leads 
to the minimal size in the worst-case, where the worst-case is over all possible vectors w in 
the current version space. Formally, given the current version space V{Pt), the next query 
should be for the label of the example in 

argmin max Vol(y(P( U {(x, sgn((w, x)))})) . (2) 

Denoting Vtx = V{Pt U {(x, 1)}) and Vf.} = V[Pt U {(x, —1)}), an equivalent way to write 
Equation Q is0 

argmax Yo\{Vt\) -YoiiV,-^,^) . (3) 



To implement Equation (|3| , we need to be able to calculate the volumes of the sets V^^j, 
and Vt~x ■ Both of these sets are convex sets obtained by intersecting the unit ball with 
halfspaces. The problem of calculating the volume of such convex sets in is #P-hard if d 



is not fixed (Brightwell and Winkler, 1991). Moreover, deterministically approximating the 



volume is NP-hard in the general case ( Matousek , 2002 ) . Luckily, it is possible to approxi 



mate this volume using randomization. Specifically, in Kannan et al. (1997) a randomized 
algorithm with the following guarantees is provided: 

Lemma 4 Let K (IW^ he a convex body with an efficient separation oracle. There exists a 
randomized algorithm, such that given e, > 0, with probability at least 1 — 6 the algorithm 



5. To see the equivalence, denote a{x) = max^^Y(Pt) Vol(V;X""''"^')/Vol(l/(Pt))- Clearly, Equation B 
can be written as argmin^ a{x). Note that a{x) G [1/2, 1] and 1 - a{x) = VoI(Ft7J''^"^^™'''^')/Vol(y(Pt)). 
Therefore, Equation (|3| can be written as 

argmax — ci{x)) = argmaxmin{a(x), 1 — a{x)} — argmin max{ 1 — a{x),a{x)} . 

XX X 

Since a{x) £ [1/2, 1] we conclude that the above equals argmin^ a{x) as required. 
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returns a non-negative number T such that (1 — e)r < ¥{K) < (1 + e)r. The running time 
of the algorithm is polynomial in d, l/e,ln(l/5). 

ALuMA uses this algorithm to estimate \ol{V^\) and Vol(Vj~^) with sufficient accuracy. 
We denote an execution of this algorithm on a convex body K hy T \oYEst{K, e, 6). The 
convex body K is represented in the algorithm by the set of the constraints that define it. 

ALuMA terminates when it exhausts its budget of queries, which is provided as a pa- 
rameter to the algorithm. If the final version space V does not determine the labeling of X, 
ALuMA randomly draws several hypotheses from V and labels X according to a majority 
vote over these hypotheses. Our analysis will show that if the budget is large enough and 
the draw of the hypotheses is approximately uniform from V, then this strategy leads to 
the correct labeling of X with high probability. 

To draw a hypothesis approximately uniformly from V, we use the hit-and-run algo- 



rithm (Lovasz, 1999), which draws a random sample from a convex body K according to 
a distribution which is close in total variation distance to the uniform distribution over 
K. Formally, The following definition parametrizes the closeness of a distribution to the 
uniform distribution: 



Definition 5 Let K C R<^ be a convex body with an efficient separation oracle, and let r be 
a distribution over K . r is A-uniform i/sup^ \t{A) —¥{A) / ¥(K)\ < A, where the supremum 
is over all measurable subsets of K . 

The hit-and-run algorithm draws a sample from a A-uniform distribution in time 0(d^/A^). 

ALuMA is listed below as Alg. [T] Its inputs are the unlabeled sample X, the labeling 
oracle L, the maximal allowed number of label queries A^, and the desired confidence 6 G 
(0, 1). It returns the labels of all the examples in X. 



Algorithm 1 The ALuMA algorithm 
1: Input: X = {xi, . . . , Xm}, L : [m] — )■ { — 1, 1}, N, 5 
2: Ii [m] 
3: Vi ^ Bf 
4: for t = 1 to do 

5: Vi G It,j G {±1}, do v,^,, ^ VolEst(y,^,^, ^) 

6: Select it £ argmaXjgj^({}a.^,i • Vx,-i) 

7: It+l ^ It\ {it} 

8: Request y = L{it) 

9: Vt+i ^Vtn{w: y{w, Xi^) > 0} 
10: end for 
11: M ^ [721n(2/(5)]. 

12: Draw wi, . . . ,wm j^-uniformly from Va^+i- 

13: For each Xi return the label yi = sgn [Yl'jLi sgii((w)j, Xj)) 
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5. Proof outline 

In this section we describe the outhne of the proof of Theorem [2| The detailed proof is 
given in Appendix [A} Given S X and w, we define the partial realization of on S as 

w\s = {{x,sgn{{w,x))) :x £ S} . 

Any active learning algorithm works as follows. Let w he a vector that determines L, 
which is unknown to the learner. The algorithm starts with Si = 0. At iteration t, the 
algorithm knows w\st, and based on this information, it selects a new example x £ X and 
sets St+i = StU{x}. We can therefore represent any algorithm by a policy function vr which 
maps each partial realization to an element from We denote by S{tt, w, k) the value of 
w\si^ if policy vr is applied for k iterations, and the received labels are consistent with w. 

We define a utility function over partial realizations as follows. Let U be the uniform 
distribution over Mf. That is, a measurable subset Z C has probability mass U{Z) = 
Vol(Z)/Vol(B^). Given a partial realization w\s for some S Q X and w G B^, we define the 
utility of the partial realization to be 

f{w\s) = l-U{V{w\s))= F[v^V{w\s)]. (4) 

That is, / measures the probability mass of all vectors in B^ which are not in the version 
space corresponds to the partial realization. Intuitively, a good policy should yield partial 
realizations of high utility. 

Given a budget of k calls to the labeling oracle, we would like to construct a policy vr 
for which /(^(vr, w* , k)) is as large as possible, in the worst-case over the choice of w* . For 
technical reasons, we prefer to derive a policy aiming to maximize the expected value of 
f{S{TT,w*, k)), where expectation is with respect to w* ~ U. That is, we define 

Us{Tr,k)=E^,^u[f(.S{7r,w*,k))] . 

With this definition at hand, a non-efficient approach for deriving a good policy function 
is to perform exhaustive search over all policies, and then choose the one for which /avg(''r, k) 
is maximal. On the other hand we show that ALuMA, which is an efficient algorithm, 
implements an approximated greedy policy for increasing /avg(vr,A:). Indeed, let Pt be the 
partial realization achieved by ALuMA by the beginning of iteration t. At this point the 
algorithm "knows" that the correct separator is in V{Pt). Therefore, to make /avg(ALuMA, k) 
larger, we perform an approximately greedy step, by querying the label of an example which 
approximately maximizes 

E [f{Pt U {{x, sgn{{w*,x)))}) I w* G V{Pt)] . 

Standard algebraic manipulations can show that the above is equivalent to Equation ([s]). 
We see that ALuMA performs an approximated greedy policy for maximizing /avg- 



But, how good is this greedy policy? Recently, Golovin and Krause (2010) showed that if / 
satisfies certain conditions then an approximately greedy policy achieves an approximately- 
optimal expected utility. These conditions are called adaptive submodularity and adaptive 



For randomized algorithms, we can define a policy function for any realization of their random bits. 
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monotonicity, and it can be shown that our utility function adheres to these conditions. We 



use this to show (see Section A.l) that for any pohcy vr and integer k, if we run ALuMA 
for n iterations then 



n 



/avg(ALuMA, n) > /avg(^, k)-e 4fc . (5) 

In particular, this holds for the policy vr* that implements the optimal algorithm in the 
definition of OPT, and for k = OPT. Since for any w, S{'tt* ,w, OPT) determines the predic- 
tions of w on all the examples in X, it follows that V{S{7t* ,w, OPT)) C y(5(ALuMA, w, n)). 



This fact can be used to show (see Lemma 12) that for any w 



/avg(^*,OPT) - /avg(ALuMA,n) > U{V{w\x)) iU{V{S{kLum,w,n))) - UiV{w\x))) ■ 
Combining the above with Equation ([s]) and rearranging terms yields 

' [/(y(5(ALuMA,u;,n))) - g-i^^^(y(^|^))2' 

Our final step is to use the assumption that that w separates X with margin 7, which 
implies that U{V{w\x)) > (7/2)^^ (Lemma 13). Plugging this into Equation Am yields that. 



for n large enough, at least 2/3 of the vectors in ^(^(ALuMA, ti;, n)) are also in V{w\x)- 
Hence, an (approximate) majority vote over ^(^(ALuMA, w, n)) will correctly determine the 



labels of all the examples in X (Corollary 15). 



6. Handling inseparable data, high-dimensions, or kernels 

If the data X = {xi, . . . ,Xm} is in a very high dimension, or it is not guaranteed to be 
separable, or it is represented only using a kernel matrix, then ALuMA still can be used, 
after a preprocessing step. This preprocessing step maps the points in X to a set of points 
in a lower dimension, which are separable using the original labels of X. We describe the 
procedure below, and prove that it satisfies the requirements of Theorem [3] in Appendix [B) 
The preprocessing step is composed of two simple transformations. In the first trans- 
formation, which can be skipped if the data is known to be separable, each example Xi £ X 
is mapped to an example in dimension d + m, defined by x'^ = {axi; \/\ — o? • e^), where 
Cj is the i'th vector of the natural basis of and a > is a scalar that will be defined 
below. Thus the first d coordinates of x\ hold the original vector times a, the rest of the 
coordinates are zero, except for x'^d + = \J\ — . This mapping guarantees that the set 
X' = ( x'l, . . . , x'^) is separable with the same labels as those of X , and with a margin that 
depends on the cumulative squared-hinge-loss of the data. 



In the second transformation, a Johnson-Lindenstrauss random projection (Johnson and 



Lindenstrauss , 1984; Bourgain 1985) is applied to X' , thus producing a new set of points 
X = (xi, . . . , Xm) in a lower dimension M.^, where k depends on the original margin and on 
the amount of margin error. With high probability, the new set of points will be separable 
with a margin that also depends on the original margin and on the amount of margin error. 

If the input data is provided not as vectors in M.'^ but via a kernel matrix, then a simple 
decomposition is performed before the preprocessing begins. The full preprocessing proce- 
dure is listed below as Alg. [2] The first input to the algorithm is the data for preprocessing. 
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given as X C M*^ or as a kernel matrix K G M'"^™'. The other inputs are 7 — a margin pa- 
rameter, H — an upper bound on the margin error relative to 7, and (5, which is the required 
confidence. 



Algorithm 2 Preprocessing 

1: Input: X = {xi, . . . , x^} or G M'"^'", 7, H, 5 

2: if input data is a kernel matrix K then 
3: Find U G M"^^™ such that K = UU'^ 
4: Vi G [m] , Xi ^ row i oi U 
5: d m 

6: end if^ 

8: Vi G [wi], a;^ (axj; \/l — • Ci) 

10: M a random {±1} matrix of dimension k x (d + m) 

11: for i G [m] do 
12: Xi ^ Mx[ 

13: end for 

14: Return (xi, . . . , x^)- 



After the preprocessing step, X is used as input to ALuMA, which then returns a set of 
labels for the examples in X. These are also the labels of the examples in the original X. To 
retrieve a halfspace for X with the least margin error, any passive learning algorithm can 
be applied to the resulting labeled sample. The full active learning procedure is described 
in Alg. [3} 

Note that if ALuMA returns the correct labels for the sample, the usual generalization 
bounds for passive supervised learning can be used to bound the true error of the returned 
separator w. In particular, we can apply the support vector machine algorithm (SVM) and 
rely on generalization bounds for SVM. 



Algorithm 3 Active Learning 

1: Input: X = {xi, . . . , x^} or G M'"^™, L : [m] ^ {-1, 1}, N, 7, H, 6 

2: if input has X then 

3: Get X by running Alg. [2] with input X,j,H, 6/2. 

4: else 

5: Get X by running Alg. [2] with input K, 7, H, 6/2. 

6: end if 

7: Get {yi, . . . ,ym) by running ALuMA with input X, L, N, 6/2. 

8: Get w G M'^ by running SVM on the labeled sample {(xi, yi), . . . , {xm, Vm)}- 

9: Return w. 
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7. A Simpler Implementation of ALuMA 

The ALuMA algorithm described in Alg. [Tjuses 0{Nm) volume estimations as a black-box 
procedure, where N is the budget of labels and m is the pool size. The complexity of each 
application of the volume estimation procedure is 0{S') where d is the dimension. Thus 
the overall complexity of the algorithm is 0{Nmd^). This complexity can be somewhat 
improved under some "luckiness" conditions. 

The volume estimation procedure uses A-uniform sampling based on hit-and-run as its 
core procedure. Instead, we can use hit-and-run directly as follows: At each iteration of 
ALuMA, instead of step [5| perform the following procedure: 

Algorithm 4 Estimation Procedure 
1: Input: A G {0,^),VtJt 

, In(2jVm/<5) 
Z. K 2A2 

3: Sample hi, . . . , hk £ Vt A-uniformly. 

4: \/i e It,j G {-1,+!}, Vx,,j ^ I hi{xi) =j}\. 



The complexity of ALuMA when using this procedure is 0{N{d^/X'^ + m/A^)), which 
is better than the complexity of the full Alg. [T] for a constant A. An additional practical 
benefit of this alternative estimation procedure is that when implementing, it is easy to 
limit the actual computation time used in the implementation by running the procedure 
with a smaller number k and a smaller number of hit-and-run mixing iterationsj^ This 
provides a natural trade-off between computation time and labeling costs. 

Theorem 20 in Appendix [C] shows that letting ALuMA use Alg. |4] as the estimation 
procedure results in similar guarantees to those of the original implementation of ALuMA 
(Alg. [T]). The only condition is that the best example in each iteration induces a fairly 
balanced partition of the current version space. In our experiments we noticed that this 
is generally the case in practice. Moreover, the theorem shows that it is possible to verify 
that the condition holds while running the algorithm. Thus, the estimation procedure can 
easily be augmented with an additional verification step at the beginning of each iteration. 
On iterations that fail the verification, the algorithm will use the original black-box volume 
estimation procedure. 



8. Experiments 



We evaluated ALuMA over synthetic and real data sets and compared its label complexity 
performance to that of a passive ERM (that is, one that uses random labeled examples), as 
well as to that of QBC and CAL. 

Our implementation of ALuMA uses hit-and-run samples instead of full-blown volume 
estimation, as detailed in Section [7] above. QBC is also implemented using hit-and-run as in 
Gilad-Bachrach et al. (2005 ). For both ALuMA and QBC, we used a fixed number of mixing 
iterations for hit-and-run, which we set to 1000. We also fixed the number of sampled 
hypotheses at each iteration of ALuMA to 1000, and used the same set of hypotheses 



Gilad-Bachrach et al. ( 2005 1 report that the actual mixing time of hit-and-run is much faster than the one 



guaranteed by the theoretical bounds, and we have observed a similar phenomenon in our experiments. 
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d 


ALuMA 


QBC 


CAL 


ERM 


10 


29 


50 


308 


1008 


12 


38 


113 


862 


3958 


15 


55 


150 


2401 


> 20000 



Table 1: Octahedron: Number of iterations to achieve zero error 

to calculate the majority vote for classification. CAL and QBC examine the examples 
sequentially, thus the input provided to them was a random ordering of the example pool. 
Since the active learners operate by reducing the training error, the graphs we show compare 
the training errors of the different algorithms. The test errors show a similar trend. 

In each of the algorithms, the classification of the training examples is done using the 
version space defined by the queried labels. The theory for CAL and ERM allows selecting 
an arbitrary predictor out of the version space. In QBC, the hypothesis should be drawn 
uniformly at random from the version space. However, we have found that all the algorithms 
show a significant improvement in classification error if they classify using the majority vote 
classification proposed for ALuMA. Therefore, in all of our experiments, the results for all 
the algorithms are based on a majority vote classification. 

The first experiment is synthetic: the pool of examples is taken to be the support 
of the distribution described in Example [2] (the octahedron example), with an additional 
dimension to account for halfspaces with a bias. We also added the negative vertices — Cj 
to the pool. Similarly to the proof of Theorem [T| it suffices to query the vertices of the 
octahedron to reach zero error. Table [T] lists the number of iterations required in practice to 
achieve zero error by each of the algorithms. ALuMA is clearly much better than QBC and 
CAL. Furthermore, the number of queries ALuMA requires is indeed close to the number 
of vertices. 

The second batch of experiments is with the MNIST datasetj^ The examples in this 
dataset are gray-scale images of handwritten digits in dimension 784. Each digit has about 
6, 000 training examples. We performed binary active learning by pre-selecting pairs of 
digits. 

Figure [T] and Figure [2] depict the training error as a function of the label budget when 
learning to distinguish the digits 3 and 5, and between the digits 4 and 7. Both these digits 
pairs are linearly separable in this dataset. Figure [T] depicts the error as a function of the 
label budget. It is striking to observe that CAL provides no improvement over passive 
ERM in the first 1000 examples, while this budget suffices to reach zero training error for 
ALuMA. 

We also tested the effectiveness of our approach for data with labeling errors (see Sec- 
tion [6|. To this end we applied the preprocessing algorithm listed in Alg. [2] to the linear 
representation of the digits 2 and 3, which are not separable in the original representation. 
Applying the Johnson-Lindenstrauss projection to the enhanced representation resulted in 
a separable representation in dimension 800. Figure [3] depicts the performance of ALuMA, 
CAL, QBC and ERM, all on the transformed separable representation. Finally, to test 
the use of kernel representations, we generated a dataset in which two digits (4 and 7) are 
labeled as positive, and two other digits (3 and 5) are labeled as negative, and used a kernel 

8 . http : / /yann. lecun. com / exdb / mnist / 
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Figure 1: MNIST 3 vs. 5 



Figure 2: MNIST 4 vs. 7 
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Figure 3: MNIST 2 vs. 3 (non-separable). Figure 4: MNIST 4,7 vs. 3,5 with kernel RBF. 



RBF representation so that the data is separable. The preprocessing step resulted in a 
separable representation in dimension 700. The results of running each of the algorithms 
on this representation are depicted in Figure |4] Note that QBC does not use the entire 
budget of labels in this experiment. This is because the mechanism by which QBC selects 
an example takes a very long time when the training error is small, thus we were not able 
to run it long enough so that it uses its full label budget. 

We also tested the algorithms on the PCMAC dataset0 This is a real-world data set, 
which represents a two-class categorization of the 20-Newsgroup collection. The examples 
are web-posts represented using bag-of- words. The original dimension of examples is 7511. 
We used the Johnson-Lindenstrauss projection to reduce the dimension to 300, which kept 
the data still separable. We used a training set of 1000 examples. Figure [5] depicts the 
results. Here too we were not able to run QBC long enough to use its entire budget. 

The experiments on MNIST and PCMAC show that ALuMA is superior to CAL and 
QBC for real-world distributions, in which CAL and QBC have no theoretical analysis. The 
next experiment shows that ALuMA outperforms CAL and QBC even on a data sampled 



9. http://vikas.sindhwani.org/datasets/lskm/matlab/pcmac.mat 
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Figure 6: Uniform distribution (d = 10). 
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Figure 7: Uniform distribution (d = 100). 



from the uniform distribution on a sphere, which is the distribution with the best guarantees 
for both CAL and QBC. Figure [6] and Figure [7| depict the training error as a function of 
the label budget when learning a random halfspace over the uniform distribution in M.^^ 



and in 



slOO 



Here too, CAL requires many more labels than ALuMA requires. For 



sioo 



it 



improves over passive learning only long after ALuMA has reached zero training error. The 
difference between the performance of the different algorithms is less marked for d = W 
than for d = 100, suggesting that the difference grows with the dimension. 



9. Discussion 

We studied active learning of halfspaces under a margin assumption. We have shown that a 
large margin assumption alone cannot guarantee a uniform improvement in label complexity 
over passive learning. However, the margin assumption enables us to derive an algorithm 
with regret-type guarantees: it will not query many more labels than OPT, where OPT is 
the minimal number of queries required to achieve zero training error on the given pool of 
examples, in the worst-case over the target hypotheses. 

An open problem, which we leave for future work, is whether one can be efficiently 
competitive with respect to a relaxed definition of OPT, one which only requires achieving 
zero training error with high probabihty, and only if the target hypothesis separates the 
training set with a margin 7. 
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Appendix A. Analysis of ALuMA 

In this section we prove Theorem [2] by showing that ALuMA satisfies the conditions of 
the theorem. First, note that each step in Alg. [T] runs in polynomial time in A^, m, d and 
ln(l/5). Since each step is repeated at most N <m times, ALuMA is polynomial in m,d 
and ln(l/5). This proves item ([3]) of Theorem [2| In the following we prove items ^ and 
([2]) , following the proof outline described in Section [s] The result will be stated formally as 



Corollary 15 



A.l ALuMA Increases the Utility Function Fast 

Recall that in Equation Q we defined a utility function /, that measures the progress of 
our algorithm. In this section we redefine / using a slightly different notation. Let Ti be 
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the hypothesis class induced by homogeneous halfspaces in M'^. We define a version space 
of a partial realization as a subset of % as follows: 

V{Pt) = {hen: V(x,y) G Pt, h{x) = y} . 

We further define the probability mass of a set G QH hy 

P(G) = F {{weMf\3he G, Vx G R'^, h{x) = sgn((u;, x))}), 

where U is the uniform distribution over Bp Consequently, the expected value of a random 
variable, g : % ^ R, is defined as 



M9{h)]= / g{h)ndh) . 
Jhen 

The utility function / for h £ H and C X is thus 

f{h\s) = 1 - nv{h\s)) = m \ v{h\s)). (7) 

Let Cx.n = {h\x' ■ X' O X, h £ 71} he the set of all possible partial labelings of X by a 
hypothesis inH. A policy is any function vr : Cx,n — ^ X. For a policy vr, an integer k, and 
a hypothesis h £ V., we denote by ^(Tr, /i, k) the first k (example,label) pairs queried by vr, 
under the assumption that L = h\x- That is, 

5(7r,/i,l) = {(7r(0),/i(7r(0)))}, and 

^(vr, h, k) = 5(7r, h,k -l)\J {{^{S{-K, h,k- 1)), h{7r{S{Tr, h,k- 1))))}. 

The expected utility of applying policy vr for k steps is 

Ug{7r,k)=Eh[f{S{7r,h,k))]. 

In this section we show that ALuMA increases /avg almost as fast as any other policy, 
including the optimal one. We prove that with probability at least 1 — 5/2, for any policy 
vr and any n,k > 0, 

11 

/avg(ALuMA, n) > /avg(vr, k) - e' . (8) 
To prove this inequality we present the notion of an adaptive sub-modular function, first 



defined in Golovin and Krause (2010). Let / : jC.x,'h — be any utility function from the 
set of possible partial labelings of X to the non-negative reals. We define the notions of 
adaptive monotonicity and adaptive sub-modularity of a utility function using the following 
notation: For an element x G X, a subset Z C X and a hypothesis h € Ti, we define 
the conditional expected marginal benefit of x, conditioned on having observed the partial 
labeling h\z, by 

A{h\z, x) = Eg [f{g\zu{.}) - f{9\z) \ g\z = h\z] ■ 

Put another way, A{h\z,x) is the expected improvement of / if we add to Z the element 
X, where expectation is over a choice of a hypothesis g taken uniformly at random from the 
set of hypotheses that agree with h on Z. 
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Definition 6 (Adaptive Monotonicity) A utility function f : Cx,'H ~^ ^+ adaptive 
monotone if the conditional expected marginal benefit is always non-negative. That is, if for 
allhe'H,Z CX andxe X, A(/i|z, x) > 0. 

Definition 7 (Adaptive Submodularity) A function f : Cx,'H ~^ adaptive sub- 

modular if the conditional expected marginal benefit of a given item does not increase if the 
partial labeling is extended. That is, if for all h £ H, for all Zi C Z2 X ,and for all 
X £ X, 

The central theorem of adaptive submodularity, stated below as Theorem [9| links the 
expected utility of the optimal policy for maximizing /avg with the expected utility of an 
approximately-greedy policy. 

Definition 8 (Approximate Greedy) Let a >1. A policy it : Cx,'H X is a-approximately 
greedy with respect to a utility function f if for every h and for every Z Q X 

A{h\z,TT{h\z))>-maxA{h\z,x). (9) 
a xex 

Theorem 9 ( Golovin and Krausej (2010 )) Let f : Cx,'H ~^ ^+ be a utility function, 
and let vr : Cx,n X be a policy. If f is adaptive monotone and adaptive submodular, and 
TT is a-approximately greedy, then for any policy n* and for all positive integers n, k, 

f..s{7T,n) > (1 -e-^)/avg(7r*,fc). (10) 

In the active learning setting, we define the utility function / as in Equation ([T]) and have 
the following result: 

Lemma 10 (Golovin and Krause (2010)) The function f defined in Equation ^ is 
adaptive monotone and adaptive submodular. 

Therefore Theorem [9] holds for this utility function. It follows that for any a-approximate 
greedy policy vr, any policy vr* and any integers n. A; > 0, 



n 



/avg(7r,n) > (1 - e )/avg(7r*, A;) > /avg(vr*. A:) - e ak . (H) 

In order to prove that Equation ([s]) holds, we now show that the policy implemented by 
ALuMA is approximately greedy relative to /, with high probability over the randomization 
of ALuMA. 

Theorem 11 With probability at least 1 — 5/2, the policy applied by ALuMA is a 4- 
approximately greedy policy with respect to the utility function f defined in Equation 

Proof Define A = 1/3 and a = (^jz^) = 4. We need to show that for any Z C X, our 
policy selects an element x that approximately maximizes 

A{h\z, x) = Eg^c/ [f{g\zvj{x}) - f{g\z) \ g\z = h\z] 

'nvt].) ■ {nv{h\z)) - nv{h\z u nx, i)m+ 

p(y,-i) • {nv{h\z) - nv{h\z u {{x, -1)})))) 

2F{V,],)F{V,;J). 
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Therefore it suffices to show that for all x & X and for all iterations t, 

mt]jnvt:^,) > ^p(v;ijp(y,-i). 

where xt is the element chosen by ALuMA at iteration t. 

By line [6] of Alg. [T| we select xt £ Xt that maximizes Vx,i ■ Vx-i, where 

t),,, ^VolEst(y,f,,A,^) , je{±l}. 

Since ALuMA calls VolEst at most 2mN times in total, by Lemma |4] with probability 

1 - 6/2, for ah x G X,j G {±l},t G [N], 

Therefore, for all x G X,j G {±l},t G [N], 



(1 + A)2 "'^ "'-^ ^ (1- A)2 • 

Let x*t = argmax^g^ A(/i|x,a;) = aTgmax^(,xi^{yt,x)^iyt7x))- Then 



Therefore 



Since 4 = a = this proves that our policy is 4-approximately greedy. 



(l-A) 



A.2 Comparing to OPT 

By Theorem [TTl ALuMA implements an approximately greedy policy. In addition, by 



Equation (11), any approximately greedy policy reduces the version space almost as fast as 
any other policy, including the optimal one. While these results seem promising, they are 
not enough for our needs. Recall that our true objective is not to have a small version space 
but rather to be able to correctly label all the examples in X. Therefore, we must quantify 
how the size of the version space corresponds to this objective. 



The first issue with Equation (11) is that it does not provide a worst-case guarantee, 



since /avg averages over all h ^ %. It should be noted that Golovin and Krause (2010) 
derive a worst-case guarantee as well, but their approach requires that the utility function 
receive discrete values, which clearly does not hold in our case. The following lemma helps 
solve this issue, by providing a guarantee which holds individually for any h & H. 
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Lemma 12 Let vr* be a policy that achieves OPT, that is Cwci"^*) = OPT. For any h & Ti, 
any a- approximate greedy policy vr, and any n, 

/avg(7r*, OPT) - /avg(7r, n) > FiV{h\x)) (P(V^(5(7r, h, n))) - P(y(/i|x))) • 

Proof Since vr* is an optimal policy, the version space induced by the labels vr* queried 
within OPT iterations is exactly the set of hypotheses which are consistent with the true 
labels of the sample. Therefore, for any h ^Ti. 

¥{ViS{7r*,h,0FT))) = ¥{V{h\x)). 

By definition of /avgj 

/avg(7r*, OPT) - /avg(^, n) = Eh^umV{S{7r, h, n))) - F{V{S{7r*, h, OPT)))] 

= Eh^umV{S{7r, h, n))) - F{V{h\x))]. 

Since ^(vr, h, n) does not depend on the value of h outside of X, we can sum over the possible 
labelings of X to have 

/avg(^*,OPT)-/avg(vr,n) = Yl nV{h\x)){nV{S{TTMx,n)))-ny{h\x))). 

h\x:heH 

Now, it is easy to see that for any h £71, V{S{7r,h\x,n))) 5 V{h\x), thus 

F{V{S{Tr,h\x,n))) - ¥{V{h\x)) >0. 
It follows that for any h £ Ti 

Us{7r*,0FT) - Ug{7T,n) >F{V{h\x)myiS{7T,h\x,n))) - F{V{h\x))). 



Combining Equation ( 11 ) and Lemma 12 we conclude that for any a-approximate greedy 
policy vr, 

n 

yhen, ¥iV{h\x)){nV{S{Tr,h\x,n)))-¥{V{h\x))) < e^^^OPT , 
which yields 

v,,«, ^g;(/M^>^™^. (,3) 

¥{V{S{-iT,h,n})) e'^OPT +r{V{h\x)f 

This means that if ¥(V{h\x)) is large enough and we run an approximate greedy policy, 
then after a sufficient number of iterations, most of the remaining version space induces the 
correct labeling of the sample. We now show that under the margin assumption, ¥(y{h\x)) 
is indeed bounded from below. 
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A. 3 Using the Margin Assumption 

Denote by T-L^ the subset of % which has margin 7 on the unlabeled sample X, that is 

■H^ = {/i e -H I 3ii; G Bf s.t. Vx G X, h{x){w,x) > 7}. 

Our bound on the number of queries for a hypothesis with a margin hinges on the fact 
that ¥{V{h\x)) is large enough if h has a margin on X. This is quantified in the following 
lemma. 

Lemma 13 For all h £ U^, ¥{V{h\x)) > {^^ ■ 

Proof Fix some h G T-L-f. Let w £ Mf such that Vx G X, h{x){w,x) > 7. For a given 
V G Bf , denote hy the mapping x 1— )• sgn((f , x)). Note that for all f G Bf such that 

\\w — v\\ < 7, /i^ G V{h\x)- This is because for all x G X, 

h{x){v, x) = {v — w, h{x) ■ x) + h{x){w, x) 
> —\\w — v\\ ■ \\h{x) • x|| + 7 > —7 + 7 = 0, 



which implies sgn((?;,x)) = h{x). It follows that {f | /i„ G V{h\x)} 5 Bf n B{w,j), where 
B{z,r) denotes the ball of radius r with center at z. Let n = (1 — 7/2)7/;. Then for any 
z G B(u,'y/2), we have z G B^, since 

ll^ll = llz — u + < — + <7/2 + l— 7/2 = 1. 

In addition, z G B{w, 7) since 

— = llz — n + ti — w|| < ||z — -ull + llu — ?i>|| < 7/2 + 7/2 = 7. 

Therefore B{u,-f/2) C Bf nS(w,7). We conclude that {v \ K e V{h\x)} 5 B{u,-f/2). 
Thus, 



7\ 



IP(^(/iU)) > Vol(S(n,7/2))/Vol(Bf) > ^ ^ 



We can now ensure that most of the remaining version space induces the correct labeling. 



The following theorem is a direct consequence of combining Lemma 13 with Equation (13). 



Theorem 14 For any h G Ti^y, any a-approximate greedy policy tt, and any 

n>a- OPT • (2dln(2/7) + ln(2)), 

we have 

nv{h\x)) ^ 2 ^^^^ 



F{V{S{Tr,h,n))) 3 
We are now ready to prove items ^ and ^ of Theorem [2j 
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Corollary 15 // ALuMA is executed forn>4 - OPT • {2dln{2/j) + ln(2)) iterations, then 
with probability at least 1 — 5 ALuMA returns the correct labeling of all the elements of X. 



Proof By Theorem 11, with probability at least 1 — 5/2 ALuMA runs a 4- approximately 
greedy policy. Therefore, if n satisfies our assumption, then Equation (14) holds. Since 

V{h\x)'^V{S{7i,h,n)) = Vn 
V{h\x] 



it follows that the probability of drawing a hypothesis from 

of ALuMA, M > 721n(2/5) 



when drawing uniformly from Vn is at least ^ . In step 



12 



hypotheses are drawn ^-uniformly at random from Vn- Therefore each hypothesis hi is from 
V{h\x) with probability at least ^. By Hoeff ding's inequality. 



1 1 

e V{h\x)) <^]< exp(-M/72) 



i=l 



5 
2' 



Therefore, with probability 1—5/2 the majority vote over the drawn hypotheses provides the 
correct label for all x & X. In total, the correct labeling is returned with probability 1 — 5.M 



Appendix B. Analysis of the Preprocessing Procedure 

We now prove Theorem [3] by showing that Alg. [2] satisfies the claims of the theorem. It is 
clear that Alg. [2] is polynomial as required in item ([s]). In addition, item ([T]) holds from the 
definition of Alg. [2] We have left to prove item ((2j). We first prove that it holds for the case 
where the input is represented directly as X C M'^. 

We start by showing that under the assumption of Theorem [sj the set {x[, . . . , x^}, 
which is generated in step [8| is separated with a bounded margin by the original labels of 
Xi. Fix 7 > and w* G B^. For each i G [m], define 

ii = max(0, 7 — L{i){w* , Xi)). 

Thus, ii quantifies the margin violation of example Xi by w* , relative to its true label L{i). 

Lemma 16 If H > Yl^i ^1' where H is the input to Alg. then there is aw £ Mf~^"^ such 
that for all i S [ml, L(i)(w,x'-) > — 

Proof By step jsjin Alg. j2| x[ = {a ■ Xi; \Jl — • e^), where a = \J Define 

w' = K; --L=(L{l)h, . . . , L{m)im)). 
V 1 — a 

Then 

L{i){w' , x'i) = aL{i){w* , Xi) + aii > 0(7 — £i) + a£i = 07. 
Let = Then w G Bf+™, and 

m{w,x',) = m^> - ^ 



1 M /?2 /_L _|_ 1 V"^ 
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Set = - — and assume H > y^^li (f. Then 



L{i){w,x'i)> ^ 



1 + Vh' 



The set {xi, . . . , Xm} retur ned by Alg.[2]i s a John son-Lindenstrauss projection of {x'l, . . . , 



on M . It is known (see e.g. Balcan et al. (2006b)) that if a set of m points is separable 



with margin rj and k > O ^^^^^p^j, then with probabihty 1 — 6, the projected points are 
separable with margin ri/2. Setting i] = it is easy to see that stepp2^in Alg.j2^indeed 

maintains the desired margin. This completes the proof of item ^ of Theorem [sj^for the 
case where the input is X C R™-. 

We now show that if the input is a kernel matrix K, then the decomposition step [3] 
preserves the separation properties of the input data, thus showing that item ([2| holds in 
this case as well. To show that our decomposition step does not change the properties of the 
original data, we first use the following lemma, which indicates that separation properties 
are conserved under different decompositions of the same kernel matrix. 



Lemma 17 (Sa bato et al.| ( |2010[ ), Lemma 6.3) Let K e R^x'" be an invertible PSD 
matrix and let V £ R"*x",[/ e R™-xfc be matrices such that K = VV'^ = UU"^ . For any 
vector w G R" there exists a vector u G R'^ such that Vvu = Uu and \\u\\ < \\w\\. 

The next lemma extends the above result, showing that the property holds even if K is 
not invertible. 

Lemma 18 Let K £ R™^™ be a PSD matrix and let V £ R™^",[/ e R'""''' be matrices 
such that K = VV^ = UU'^ . For any vector u) E R" there exists a vector u G R*"' such that 
Vw = Uu and \\u\\ < \\w\\. 

Proof For a matrix A and sets of indexes /, J let A[I] be the sub-matrix of A whose rows 
are the rows of A with an index in I. Let A[I, L] be the sub- matrix of A whose rows and 
columns are those that have index I in A. 



If K is invertible, the claim holds by Lemma 17 Thus, assume K is singular. Let / C [m] 
be a maximal subset such that the matrix K[L;L] is invertible — If no such subset exists 
then K,V,U are all zero and the claim is trivial. By Lemma 17, K[L;L] = V[L](V[L])'^ — 



U[L]{U[I])'^ , and there exists a vector u such that = U[I]u, and ||n|| < \\uj\\. We will 

show that for any i ^ L, V[i]u! = U[i]u as well. 

For any i ^ /, K[L U {i}; L U {i}] is singular. Therefore y[/ U {?}] is singular, while 
V[L] is not. Thus there is some vector A E RI^I such that V[i]'^ = VlLj'^X. By a 
similar argument there is some vector rj £ R'^' such that U[if^ = U[L]'^ri. We have 
K[I,i] = V[L]V[i]'^ = V[L]V[lfX = K[I,L]X. Similarly for U, K[L,i] = K[L,I]ri. There- 
fore K[L,L]{X — rj) = 0. Since K[I,L] is invertible, it follows that X = rj. Therefore, 
U[i]u = ri^U[L\u = X^V[L]w = V[i]w. ■ 

We now use this lemma to show that the decomposition step does not change the upper 
bound on the margin loss which is assumed in Theorem [3] 
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Theorem 19 Let ipi, . . . ^tp^a be a set of vectors in a Hilbert space S, and let K £ ^"ix™- 
such that for all i,j G [m], Kij = suppose there exists awGS with \\w\\ < 1 such 

that 

m 

H>^max{0,^-yi{w,i>,)f. (15) 

i=l 

Let U G M'^x'^ such that K = UU'^ and let Xi be row i ofU. Then there exists au eM'f 
such that 

m 

> max(0, 7 - yi{u, Xi)f . (16) 

i=l 

Proof Let ai, . . . , a„ G 5 be an orthogonal basis for the span of V'l, • • • , V'm and w, and 
let vi, ...,Vm,v^ such that Ya=i ^iW"^ = V'i and Yd=i Vw[l]o^i = w. Let V G M""'<" 

be a matrix such that row i of the matrix is Vi. Then K = VV'^ , and Vv^ = r, where 



r[i] = {vw,Vi) = {w,ipi). By Lemma 18, there exists a u G M such that Uu = r. Then we 



have (u, Xj) = r[i]. Therefore for all i G [m], {wjtpi) = {u,Xi), thus Equation (15) implies 
Equation (16). In addition, ||n|| < = \\w\\ < 1, therefore u G B^. ■ 



Appendix C. Analysis of the Simpler Implementation 

The following theorem shows that using the estimation procedure listed in Alg.[4]also results 
in an approximately greedy policy, as does the original implementation of ALuMA. Similarly 
to the proof of Corollary 15, it follows that Theorem [2] holds for this implementation as 
well. 

Theorem 20 If for each iteration t of the algorithm, the greedy choice x* satisfies 

Vj G {-1, +1}, F[h{x*) =j\h£Vt]>^VX 

then ALuMA with the estimation procedure implements a 2-approximate greedy policy. 
Moreover, it is possible to efficiently verify that this condition holds while running the algo- 
rithm. 

Proof Fix the iteration t, and denote = F{Vt]J/F{Vt) and p^-i = F{Vt]^)/F{Vt). 
Note that px^i + Px,-i = 1. Since hi,...,hk are sampled A-uniformly from the version 
space, we have 

yie[k],\F[h,eVt],]-pxj\<X. (17) 

In addition, by Hoeffding's inequality and a union bound over the examples in the pool and 
the iterations of the algorithm, 

F[3x, \vx,,j - F[hi G V/j\ > A] < 2mexp(-2A;A2). (18) 

From Alg. |4]we have k = ^"^^^^^^ . Combining this with Equation (17) and Equation (18) 
we get that 

F[3x,\vx^j-Px,,:i]\ >2X]<S. 
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The greedy choice for this iteration is 



X* G argmax A(/i|x, a;) = argmax(P(pa;^i) P(p^^_i)). 

xex zex 



By the assumption in the theorem, Px*,j > 4\/A for j G { — 1, +!}• Since A G (0, we 
have A < \/X/8. Therefore px*,j - 2A > aVJ - \/A/4 > VWX. Therefore 



Vx'AVx',-1 > iPx',1 - 2X){px*,^i - 2A) > lOA. (19) 

Let X = argmax({)2;,_i'Ux,+i) be the query selected by ALuMA using Alg.|4) Then 

Vx*-iVx*,+i < Vx-iVx,+i < {Px,i + 2X){px-i + 2A) < px,iPx-i + 4A. 

Where in the last inequality we used the facts that px^i + px-i = 1 and 4A^ < 2A. On the 
other hand, by Equation (19) 

11 1 

Vx*-iVx*,+i > 5A + -Vx*-iVx*,+i > 5A + -{px*-i - 2X){px*^-i - 2A) > 4A + -px*-iPx*,-i- 

Combining the two inequalities for Vx* -iVx* ,+i it follows that px,iPx,-i ^ ^Px*-iPx*,-i, 
Thus this is a 2-approximately greedy policy. 

To verify that the assumption holds at each iteration of the algorithm, note that for all 
X = Xi such that i £ It 

Px-iPx,+i > {vx-i - 2A)(-u^,+i - 2A) > Vx-iVx,+i - 2A. 

therefore it suffices to check that for all x = Xi such that i G It 

Vx-iVx,+i > 4\/A + 2A. 



28 



